iT邦幫忙

2023 iThome 鐵人賽

DAY 6
0
AI & Data

30天胡搞瞎搞學會pyspark系列 第 6

[ Day 6 ] - Pyspark | 介紹 - DataFrame篇 - Sample

  • 分享至 

  • xImage
  •  

在資料處理的領域,除了前幾天說的那些overview之外,了解了整個宏觀的資料集,我們還是會需要深入去確認資料的樣態,當資料集有排序性的時候,使用show(),limit(),甚至是你用filter(),效果都沒有使用sample()的效果好
那我們就快點來看看sample()要再怎麼使用吧~

開始囉!

1. sample()

sample(withReplacement,fraction,seed)
withReplacement – 採樣之後是否直接覆蓋原來的資料,一般情況下是false,不做替換,也就是回傳一個新的資料,原來的資料不變。
fraction – 採樣的比例參數, 範圍是[0.0, 1.0].
seed – 取樣的隨機種子,這個也是隨機取樣,就像你產生隨機資料一樣,也是需要隨機種子

rdd = sc.parallelize(
    [
    ("drink", 2, "Carmen",23,'Female'),
     ("movie", 2, "Juliette",16,'Female'),
     ("write", 2, "Don José",25,'Male'), 
     ("sleep", 2, "Escamillo",30,'Male'),
     ("play", 2, "Roméo",18,'Male'),
     ("swim", 3, "Vivi",18,'Female'),
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.sample(withReplacement=True, fraction=0.5, seed=4).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
|  Thing|Hour|     Name|Age|Gender|
+-------+----+---------+---+------+
|  drink|   2|   Carmen| 23|Female|
|  movie|   2| Juliette| 16|Female|
|writing|   2| Don José| 25|  Male|
|  sleep|   2|Escamillo| 30|  Male|
|   play|   2|    Roméo| 18|  Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.sample(withReplacement=True, fraction=0.5, seed=4).show()
+-----+----+--------+---+------+
|Thing|Hour|    Name|Age|Gender|
+-----+----+--------+---+------+
|movie|   2|Juliette| 16|Female|
|write|   2|Don José| 25|  Male|
+-----+----+--------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

2. sampleBy()

sampleBy(col,fraction,seed)
col – 選擇特定的欄位進行採樣
fraction – 每個層的抽樣分數。 如果未指定層,我們將其分數視為零。.
seed – 取樣的隨機種子,這個也是隨機取樣,就像你產生隨機資料一樣,也是需要隨機種子

rdd = sc.parallelize(
    [
    ("drink", 2, "Carmen",23,'Female'),
     ("movie", 2, "Juliette",16,'Female'),
     ("write", 2, "Don José",25,'Male'), 
     ("sleep", 2, "Escamillo",30,'Male'),
     ("play", 2, "Roméo",18,'Male'),
     ("swim", 3, "Vivi",18,'Female'),
    ("swim", 3, "Gary",18,'Male'),
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.sampleBy(col("Hour"), fractions={2: 0.5, 3: 1}, seed=1).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-----+----+---------+---+------+
|Thing|Hour|     Name|Age|Gender|
+-----+----+---------+---+------+
|drink|   2|   Carmen| 23|Female|
|movie|   2| Juliette| 16|Female|
|write|   2| Don José| 25|  Male|
|sleep|   2|Escamillo| 30|  Male|
| play|   2|    Roméo| 18|  Male|
| swim|   3|     Vivi| 18|Female|
| swim|   3|     Gary| 18|  Male|
+-----+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.sampleBy(col("Hour"), fractions={2: 0.5, 3: 1}, seed=1).show()
+-----+----+---------+---+------+
|Thing|Hour|     Name|Age|Gender|
+-----+----+---------+---+------+
|write|   2| Don José| 25|  Male|
|sleep|   2|Escamillo| 30|  Male|
| swim|   3|     Vivi| 18|Female|
| swim|   3|     Gary| 18|  Male|
+-----+----+---------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!

我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】


上一篇
[ Day 5 ] - Pyspark | 介紹 - DataFrame篇 - Filter
下一篇
[ Day 7 ] - Pyspark | 清理 - 時間篇 - 1 : Timestamp 轉換 : to_timestamp()
系列文
30天胡搞瞎搞學會pyspark30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言